{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "copyright",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "# Copyright 2021 Google LLC\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "title",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "# Vertex SDK: AutoML training tabular binary classification model for batch prediction\n",
    "\n",
    "<table align=\"left\">\n",
    "  <td>\n",
    "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/sdk/sdk_automl_tabular_binary_classification_batch.ipynb\">\n",
    "      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
    "    </a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a href=\"https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/community/sdk/sdk_automl_tabular_binary_classification_batch.ipynb\">\n",
    "      <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
    "      View on GitHub\n",
    "    </a>\n",
    "  </td>\n",
    "</table>\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "overview:automl",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "## Overview\n",
    "\n",
    "\n",
    "This tutorial demonstrates how to use the Vertex SDK to create tabular binary classification models and do batch prediction using Google Cloud's [AutoML](https://cloud.google.com/vertex-ai/docs/start/automl-users)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dataset:bank,lbn",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "### Dataset\n",
    "\n",
    "The dataset used for this tutorial is the [Bank Marketing](gs://cloud-ml-tables-data/bank-marketing.csv). This dataset does not require any feature engineering. The version of the dataset you will use in this tutorial is stored in a public Cloud Storage bucket."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "objective:automl,training,batch_prediction",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "### Objective\n",
    "\n",
    "In this tutorial, you create an AutoML tabular binary classification model from a Python script, and then do a batch prediction using the Vertex SDK. You can alternatively create and deploy models using the `gcloud` command-line tool or online using the Google Cloud Console.\n",
    "\n",
    "The steps performed include:\n",
    "\n",
    "- Create a Vertex `Dataset` resource.\n",
    "- Train the model.\n",
    "- View the model evaluation.\n",
    "- Make a batch prediction.\n",
    "\n",
    "There is one key difference between using batch prediction and using online prediction:\n",
    "\n",
    "* Prediction Service: Does an on-demand prediction for the entire set of instances (i.e., one or more data items) and returns the results in real-time.\n",
    "\n",
    "* Batch Prediction Service: Does a queued (batch) prediction for the entire set of instances in the background and stores the results in a Cloud Storage bucket when ready."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "costs",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "### Costs\n",
    "\n",
    "This tutorial uses billable components of Google Cloud (GCP):\n",
    "\n",
    "* Vertex AI\n",
    "* Cloud Storage\n",
    "\n",
    "Learn about [Vertex AI\n",
    "pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage\n",
    "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n",
    "Calculator](https://cloud.google.com/products/calculator/)\n",
    "to generate a cost estimate based on your projected usage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "install_aip",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "## Installation\n",
    "\n",
    "Install the latest version of Vertex SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "install_aip",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "\n",
    "\n",
    "# Google Cloud Notebook\n",
    "if os.path.exists(\"/opt/deeplearning/metadata/env_version\"):\n",
    "    USER_FLAG = '--user'\n",
    "else:\n",
    "    USER_FLAG = ''\n",
    "\n",
    "! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "install_storage",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "Install the latest GA version of *google-cloud-storage* library as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "install_storage",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "! pip3 install -U google-cloud-storage $USER_FLAG"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "restart",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "### Restart the kernel\n",
    "\n",
    "Once you've installed the Vertex SDK and Google *cloud-storage*, you need to restart the notebook kernel so it can find the packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "restart",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "if not os.getenv(\"IS_TESTING\"):\n",
    "    # Automatically restart kernel after installs\n",
    "    import IPython\n",
    "    app = IPython.Application.instance()\n",
    "    app.kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "before_you_begin",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "## Before you begin\n",
    "\n",
    "### GPU runtime\n",
    "\n",
    "*Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select* **Runtime > Change Runtime Type > GPU**\n",
    "\n",
    "### Set up your Google Cloud project\n",
    "\n",
    "**The following steps are required, regardless of your notebook environment.**\n",
    "\n",
    "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
    "\n",
    "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
    "\n",
    "3. [Enable the Vertex APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n",
    "\n",
    "4. [The Google Cloud SDK](https://cloud.google.com/sdk) is already installed in Google Cloud Notebook.\n",
    "\n",
    "5. Enter your project ID in the cell below. Then run the  cell to make sure the\n",
    "Cloud SDK uses the right project for all the commands in this notebook.\n",
    "\n",
    "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "set_project_id",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "PROJECT_ID = \"[your-project-id]\"  #@param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "autoset_project_id",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "if PROJECT_ID == \"\" or PROJECT_ID is None or PROJECT_ID == \"[your-project-id]\":\n",
    "    # Get your GCP project id from gcloud\n",
    "    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null\n",
    "    PROJECT_ID = shell_output[0]\n",
    "    print(\"Project ID:\", PROJECT_ID)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "set_gcloud_project_id",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "! gcloud config set project $PROJECT_ID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "region",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "#### Region\n",
    "\n",
    "You can also change the `REGION` variable, which is used for operations\n",
    "throughout the rest of this notebook.  Below are regions supported for Vertex. We recommend that you choose the region closest to you.\n",
    "\n",
    "- Americas: `us-central1`\n",
    "- Europe: `europe-west4`\n",
    "- Asia Pacific: `asia-east1`\n",
    "\n",
    "You may not use a multi-regional bucket for training with Vertex. Not all regions provide support for all Vertex services. For the latest support per region, see the [Vertex locations documentation](https://cloud.google.com/ai-platform-unified/docs/general/locations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "region",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "REGION = 'us-central1'  #@param {type: \"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "timestamp",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "#### Timestamp\n",
    "\n",
    "If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append onto the name of resources which will be created in this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "timestamp",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "TIMESTAMP = datetime.now().strftime(\"%Y%m%d%H%M%S\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gcp_authenticate",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "### Authenticate your Google Cloud account\n",
    "\n",
    "**If you are using Google Cloud Notebook**, your environment is already authenticated. Skip this step.\n",
    "\n",
    "**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.\n",
    "\n",
    "**Otherwise**, follow these steps:\n",
    "\n",
    "In the Cloud Console, go to the [Create service account key](https://console.cloud.google.com/apis/credentials/serviceaccountkey) page.\n",
    "\n",
    "**Click Create service account**.\n",
    "\n",
    "In the **Service account name** field, enter a name, and click **Create**.\n",
    "\n",
    "In the **Grant this service account access to project** section, click the Role drop-down list. Type \"Vertex\" into the filter box, and select **Vertex Administrator**. Type \"Storage Object Admin\" into the filter box, and select **Storage Object Admin**.\n",
    "\n",
    "Click Create. A JSON file that contains your key downloads to your local environment.\n",
    "\n",
    "Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gcp_authenticate",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "# If you are running this notebook in Colab, run this cell and follow the\n",
    "# instructions to authenticate your GCP account. This provides access to your\n",
    "# Cloud Storage bucket and lets you submit training jobs and prediction\n",
    "# requests.\n",
    "\n",
    "# If on Google Cloud Notebook, then don't execute this code\n",
    "if not os.path.exists(\"/opt/deeplearning/metadata/env_version\"):\n",
    "    if \"google.colab\" in sys.modules:\n",
    "        from google.colab import auth as google_auth\n",
    "\n",
    "        google_auth.authenticate_user()\n",
    "\n",
    "    # If you are running this notebook locally, replace the string below with the\n",
    "    # path to your service account key and run this cell to authenticate your GCP\n",
    "    # account.\n",
    "    elif not os.getenv(\"IS_TESTING\"):\n",
    "        %env GOOGLE_APPLICATION_CREDENTIALS ''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bucket:mbsdk",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "### Create a Cloud Storage bucket\n",
    "\n",
    "**The following steps are required, regardless of your notebook environment.**\n",
    "\n",
    "When you initialize the Vertex SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.\n",
    "\n",
    "Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bucket",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "BUCKET_NAME = \"gs://[your-bucket-name]\"  #@param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "autoset_bucket",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "if BUCKET_NAME == \"\" or BUCKET_NAME is None or BUCKET_NAME == \"gs://[your-bucket-name]\":\n",
    "    BUCKET_NAME = \"gs://\" + PROJECT_ID + \"aip-\" + TIMESTAMP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "create_bucket",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "create_bucket",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "! gsutil mb -l $REGION $BUCKET_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "validate_bucket",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "Finally, validate access to your Cloud Storage bucket by examining its contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "validate_bucket",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "! gsutil ls -al $BUCKET_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "setup_vars",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "### Set up variables\n",
    "\n",
    "Next, set up some variables used throughout the tutorial.\n",
    "### Import libraries and define constants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "import_aip:mbsdk",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "import google.cloud.aiplatform as aip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "init_aip:mbsdk",
    "repo": "snippets_housekeeping.ipynb"
   },
   "source": [
    "## Initialize Vertex SDK\n",
    "\n",
    "Initialize the Vertex SDK for your project and corresponding bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "init_aip:mbsdk",
    "repo": "snippets_housekeeping.ipynb"
   },
   "outputs": [],
   "source": [
    "aip.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tutorial_start:automl",
    "repo": "snippets_automl.ipynb"
   },
   "source": [
    "# Tutorial\n",
    "\n",
    "Now you are ready to start creating your own AutoML tabular binary classification model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "create_dataset",
    "repo": "snippets_mbsdk.ipynb"
   },
   "source": [
    "## Create a Dataset Resource\n",
    "\n",
    "First, you create an tabular Dataset resource for the Bank Marketing dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "data_preparation:tabular,u_dataset",
    "repo": "snippets_common.ipynb"
   },
   "source": [
    "### Data preparation\n",
    "\n",
    "The Vertex `Dataset` resource for tabular has a couple of requirements for your tabular data.\n",
    "\n",
    "- Must be in a CSV file or a BigQuery query."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "data_import_format:lbn,u_dataset,csv",
    "repo": "snippets_common.ipynb"
   },
   "source": [
    "#### CSV\n",
    "\n",
    "For tabular binary classification, the CSV file has a few requirements:\n",
    "\n",
    "- The first row must be the heading -- note how this is different from Vision, Video and Language where the requirement is no heading.\n",
    "- All but one column are features.\n",
    "- One column is the label, which you will specify when you subsequently create the training pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "import_file:u_dataset,csv",
    "repo": "snippets_common.ipynb"
   },
   "source": [
    "#### Location of Cloud Storage training data.\n",
    "\n",
    "Now set the variable `IMPORT_FILE` to the location of the CSV index file in Cloud Storage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "import_file:bank,csv,lbn",
    "repo": "snippets_common.ipynb"
   },
   "outputs": [],
   "source": [
    "IMPORT_FILE = 'gs://cloud-ml-tables-data/bank-marketing.csv'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "quick_peek:tabular",
    "repo": "snippets_common.ipynb"
   },
   "source": [
    "#### Quick peek at your data\n",
    "\n",
    "You will use a version of the Bank Marketing dataset that is stored in a public Cloud Storage bucket, using a CSV index file.\n",
    "\n",
    "Start by doing a quick peek at the data. You count the number of examples by counting the number of rows in the CSV index file  (`wc -l`) and then peek at the first few rows.\n",
    "\n",
    "You also need for training to know the heading name of the label column, which is save as `label_column`. For this dataset, it is the last column in the CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "quick_peek:tabular",
    "repo": "snippets_common.ipynb"
   },
   "outputs": [],
   "source": [
    "count = ! gsutil cat $IMPORT_FILE | wc -l\n",
    "print(\"Number of Examples\", int(count[0]))\n",
    "\n",
    "print(\"First 10 rows\")\n",
    "! gsutil cat $IMPORT_FILE | head\n",
    "\n",
    "heading = ! gsutil cat $IMPORT_FILE | head -n1\n",
    "label_column = str(heading).split(',')[-1].split(\"'\")[0]\n",
    "print(\"Label Column Name\", label_column)\n",
    "if label_column is None:\n",
    "    raise Exception(\"label column missing\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "create_dataset:tabular,lbn",
    "repo": "snippets_mbsdk.ipynb"
   },
   "source": [
    "### Create the Dataset\n",
    "\n",
    "Next, create the `Dataset` resource using the `create()` method for the `TabularDataset` class, which takes the following parameters:\n",
    "\n",
    "- `display_name`: The human readable name for the `Dataset` resource.\n",
    "- `gcs_source`: A list of one or more dataset index file to import the data items into the `Dataset` resource.\n",
    "\n",
    "This operation may take several minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "create_dataset:tabular,lbn",
    "repo": "snippets_mbsdk.ipynb"
   },
   "outputs": [],
   "source": [
    "dataset = aip.TabularDataset.create(\n",
    "    display_name=\"Bank Marketing\" + \"_\" + TIMESTAMP,\n",
    "    gcs_source=[IMPORT_FILE]\n",
    ")\n",
    "\n",
    "print(dataset.resource_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "train_automl_model",
    "repo": "snippets_automl.ipynb"
   },
   "source": [
    "## Train the model\n",
    "\n",
    "Now train an AutoML tabular binary classification model using your Vertex `Dataset` resource. To train the model, do the following steps:\n",
    "\n",
    "1. Create an Vertex training pipeline for the `Dataset` resource.\n",
    "2. Execute the pipeline to start the training."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "create_automl_pipeline:tabular,lbn",
    "repo": "snippets_mbsdk.ipynb"
   },
   "source": [
    "### Create and run training pipeline\n",
    "\n",
    "To train an AutoML tabular binary classification model, you perform two steps: 1) create a training pipeline, and 2) run the pipeline.\n",
    "\n",
    "#### Create training pipeline\n",
    "\n",
    "An AutoML training pipeline is created with the `AutoMLTabularTrainingJob` class, with the following parameters:\n",
    "\n",
    "- `display_name`: The human readable name for the `TrainingJob` resource.\n",
    "- `optimization_prediction_type`: The type task to train the model for.\n",
    "  - `classification`: A tabuar classification model.\n",
    "  - `regression`: A tabular regression model.\n",
    "  - `forecasting`: A tabular forecasting model.\n",
    "- `column_transformations`: (Optional): Transformations to apply to the input columns\n",
    "- `optimization_objective`: The optimization objective to minimize or maximize.\n",
    "  - `minimize-log-loss`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "create_automl_pipeline:tabular,lbn",
    "repo": "snippets_mbsdk.ipynb"
   },
   "outputs": [],
   "source": [
    "dag = aip.AutoMLTabularTrainingJob(\n",
    "    display_name=\"bank_\" + TIMESTAMP,\n",
    "    optimization_prediction_type=\"classification\",\n",
    "    optimization_objective=\"minimize-log-loss\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "run_automl_pipeline:tabular",
    "repo": "snippets_mbsdk.ipynb"
   },
   "source": [
    "#### Run the training pipeline\n",
    "\n",
    "Next, you run the DAG to start the training job by invoking the method `run()`, with the following parameters:\n",
    "\n",
    "- `dataset`: The `Dataset` resource to train the model.\n",
    "- `model_display_name`: The human readable name for the trained model.\n",
    "- `target_column`: The name of the column to train as the label.\n",
    "- `training_fraction_split`: The percentage of the dataset to use for training.\n",
    "- `validation_fraction_split`: The percentage of the dataset to use for validation.\n",
    "- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).\n",
    "- `budget_milli_node_hours`: (optional) Maximum training time specified in unit of millihours (1000 = hour).\n",
    "- `disable_early_stopping`: If `True`, training maybe completed before using the entire budget if the service believes it cannot further improve on the model objective measurements.\n",
    "\n",
    "The `run` method when completed returns the `Model` resource.\n",
    "\n",
    "The execution of the training pipeline will take upto 20 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "run_automl_pipeline:tabular",
    "repo": "snippets_mbsdk.ipynb"
   },
   "outputs": [],
   "source": [
    "\n",
    "model = dag.run(\n",
    "    dataset=dataset,\n",
    "    target_column=label_column,\n",
    "    model_display_name=\"bank_\" + TIMESTAMP,\n",
    "    training_fraction_split=0.6,\n",
    "    validation_fraction_split=0.2,\n",
    "    test_fraction_split=0.2,\n",
    "    budget_milli_node_hours=1000,\n",
    "    disable_early_stopping=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "deploy:batch_prediction",
    "repo": "snippets_common.ipynb"
   },
   "source": [
    "## Model deployment for batch prediction\n",
    "\n",
    "Now deploy the trained Vertex `Model` resource you created for batch prediction. This differs from deploying a `Model` resource for online prediction.\n",
    "\n",
    "For online prediction, you:\n",
    "\n",
    "1. Create an `Endpoint` resource for deploying the `Model` resource to.\n",
    "\n",
    "2. Deploy the `Model` resource to the `Endpoint` resource.\n",
    "\n",
    "3. Make online prediction requests to the `Endpoint` resource.\n",
    "\n",
    "For batch-prediction, you:\n",
    "\n",
    "1. Create a batch prediction job.\n",
    "\n",
    "2. The job service will provision resources for the batch prediction request.\n",
    "\n",
    "3. The results of the batch prediction request are returned to the caller.\n",
    "\n",
    "4. The job service will unprovision the resoures for the batch prediction request."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "make_prediction",
    "repo": "snippets_common.ipynb"
   },
   "source": [
    "## Make a batch prediction request\n",
    "\n",
    "Now do a batch prediction to your deployed model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "make_test_items:automl,batch_prediction",
    "repo": "snippets_automl.ipynb"
   },
   "source": [
    "### Make test items\n",
    "\n",
    "You will use synthetic data as a test data items. Don't be concerned that we are using synthetic data -- we just want to demonstrate how to make a prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "make_test_items:automl,tabular,bank,csv",
    "repo": "snippets_automl.ipynb"
   },
   "outputs": [],
   "source": [
    "HEADING = \"Age,Job,MaritalStatus,Education,Default,Balance,Housing,Loan,Contact,Day,Month,Duration,Campaign,PDays,Previous,POutcome,Deposit\"\n",
    "INSTANCE_1 = \"58,managment,married,teritary,no,2143,yes,no,unknown,5,may,261,1,-1,0, unknown\"\n",
    "INSTANCE_2 = \"44,technician,single,secondary,no,39,yes,no,unknown,5,may,151,1,-1,0,unknown\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "make_batch_file:automl,tabular",
    "repo": "snippets_automl.ipynb"
   },
   "source": [
    "### Make the batch input file\n",
    "\n",
    "Now make a batch input file, which you will store in your local Cloud Storage bucket. Unlike image, video and text, the batch input file for tabular is only supported for CSV. For CSV file, you make:\n",
    "\n",
    "- The first line is the heading with the feature (fields) heading names.\n",
    "- Each remaining line is a separate prediction request with the corresponding feature values.\n",
    "\n",
    "For example:\n",
    "\n",
    "    \"feature_1\", \"feature_2\". ...\n",
    "    value_1, value_2, ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "make_batch_file:automl,tabular",
    "repo": "snippets_automl.ipynb"
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "gcs_input_uri = BUCKET_NAME + '/test.csv'\n",
    "with tf.io.gfile.GFile(gcs_input_uri, 'w') as f:\n",
    "    f.write(HEADING + '\\n')\n",
    "    f.write(str(INSTANCE_1) + '\\n')\n",
    "    f.write(str(INSTANCE_2) + '\\n')\n",
    "\n",
    "print(gcs_input_uri)\n",
    "! gsutil cat $gcs_input_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "make_batch_request:mbsdk,automl",
    "repo": "snippets_mbsdk.ipynb"
   },
   "source": [
    "### Make the batch prediction request\n",
    "\n",
    "Now that your `Model` resource is trained, you can make a batch prediction by invoking the `batch_request()` method, with the following parameters:\n",
    "\n",
    "- `job_display_name`: The human readable name for the batch prediction job.\n",
    "- `gcs_source`: A list of one or more batch request input files.\n",
    "- `gcs_destination_prefix`: The Cloud Storage location for storing the batch prediction resuls.\n",
    "- `sync`: If set to `True`, the call will block while waiting for the asynchronous batch job to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "make_batch_request:mbsdk,automl",
    "repo": "snippets_mbsdk.ipynb"
   },
   "outputs": [],
   "source": [
    "batch_predict_job = model.batch_predict(\n",
    "    job_display_name=\"$(DATASET_ALIAS)_\" + TIMESTAMP,\n",
    "    gcs_source=gcs_input_uri,\n",
    "    gcs_destination_prefix=BUCKET_NAME,\n",
    "    sync=False\n",
    ")\n",
    "\n",
    "print(batch_predict_job)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wait_batch_job:mbsdk",
    "repo": "snippets_mbsdk.ipynb"
   },
   "source": [
    "### Wait for completion of batch prediction job\n",
    "\n",
    "Next, wait for the batch job to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wait_batch_job:mbsdk",
    "repo": "snippets_mbsdk.ipynb"
   },
   "outputs": [],
   "source": [
    "batch_predict_job.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "get_batch_prediction:mbsdk,lbn",
    "repo": "snippets_mbsdk.ipynb"
   },
   "source": [
    "### Get the predictions\n",
    "\n",
    "Next, get the results from the completed batch prediction job.\n",
    "\n",
    "The results are written to the Cloud Storage output bucket you specified in the batch prediction request. You call the method `iter_outputs()` to get a list of each Cloud Storage file generated with the results. Each file contains one or more prediction requests in a JSON format:\n",
    "\n",
    "- `content`: The prediction request.\n",
    "- `prediction`: The prediction response.\n",
    "  - `ids`: The internal assigned unique identifiers for each prediction request.\n",
    "  - `displayNames`: The class names for each class label.\n",
    "  - `confidences`: The predicted confidence, between 0 and 1, per class label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "get_batch_prediction:mbsdk,csv",
    "repo": "snippets_mbsdk.ipynb"
   },
   "outputs": [],
   "source": [
    "bp_iter_outputs = batch_predict_job.iter_outputs()\n",
    "\n",
    "prediction_results = list()\n",
    "for blob in bp_iter_outputs:\n",
    "    if blob.name.split(\"/\")[-1].startswith(\"prediction\"):\n",
    "        prediction_results.append(blob.name)\n",
    "\n",
    "tags = list()\n",
    "for prediction_result in prediction_results:\n",
    "    gfile_name = f\"gs://{bp_iter_outputs.bucket.name}/{prediction_result}\"\n",
    "    with tf.io.gfile.GFile(name=gfile_name, mode=\"r\") as gfile:\n",
    "        for line in gfile.readlines():\n",
    "            print(line)\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cleanup:mbsdk",
    "repo": "snippets_mbsdk.ipynb"
   },
   "source": [
    "# Cleaning up\n",
    "\n",
    "To clean up all GCP resources used in this project, you can [delete the GCP\n",
    "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n",
    "\n",
    "Otherwise, you can delete the individual resources you created in this tutorial:\n",
    "\n",
    "- Dataset\n",
    "- Pipeline\n",
    "- Model\n",
    "- Endpoint\n",
    "- Batch Job\n",
    "- Custom Job\n",
    "- Hyperparameter Tuning Job\n",
    "- Cloud Storage Bucket"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "cleanup:mbsdk",
    "repo": "snippets_mbsdk.ipynb"
   },
   "outputs": [],
   "source": [
    "delete_dataset = True\n",
    "delete_pipeline = True\n",
    "delete_model = True\n",
    "delete_endpoint = True\n",
    "delete_batchjob = True\n",
    "delete_customjob = True\n",
    "delete_hptjob = True\n",
    "delete_bucket = True\n",
    "\n",
    "\n",
    "# Delete the dataset using the Vertex dataset object\n",
    "try:\n",
    "    if delete_dataset and 'dataset' in globals():\n",
    "        dataset.delete()\n",
    "except Exception as e:\n",
    "    print(e)\n",
    "\n",
    "# Delete the model using the Vertex model object\n",
    "try:\n",
    "    if delete_model and 'model' in globals():\n",
    "        model.delete()\n",
    "except Exception as e:\n",
    "    print(e)\n",
    "\n",
    "# Delete the endpoint using the Vertex endpoint object\n",
    "try:\n",
    "    if delete_endpoint and 'model' in globals():\n",
    "        endpoint.delete()\n",
    "except Exception as e:\n",
    "    print(e)\n",
    "\n",
    "# Delete the batch prediction job using the Vertex batch prediction object\n",
    "try:\n",
    "    if delete_batchjob and 'model' in globals():\n",
    "        batch_predict_job.delete()\n",
    "except Exception as e:\n",
    "    print(e)\n",
    "\n",
    "if delete_bucket and 'BUCKET_NAME' in globals():\n",
    "    ! gsutil rm -r $BUCKET_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "vars": {
   "AIP": "AI Platform",
   "AUTOML": "AutoML",
   "AUTOML_URL": "https://cloud.google.com/vertex-ai/docs/start/automl-users",
   "BATCH_OR_ONLINE": "batch",
   "BQ": "BigQuery",
   "COLAB": "Colab",
   "DATASET_ALIAS": "bank",
   "DATASET_NAME": "Bank Marketing",
   "DATA_TYPE": "tabular",
   "DOCKER": "Docker",
   "EDGE": "Edge",
   "EMAIL": "cdpe@gmail.com",
   "GAPIC": "client library",
   "GAPICP": "SDK",
   "GCP": "Google Cloud",
   "GCS": "Cloud Storage",
   "GNOTEBOOK": "Google Cloud Notebook",
   "HEIGHT": "128",
   "IMPORT_FORMAT": "csv",
   "MILLIHOURS": "1000",
   "MODEL_TYPE": "tabular binary classification",
   "NOTEBOOK": "sdk_automl_tabular_binary_classification_batch.ipynb",
   "REGION": "us-central1",
   "REPO": "ai-platform-unified/notebooks/community/gapic/custom",
   "TENSORFLOW": "TensorFlow",
   "TFLite": "TFLite",
   "TFServing": "TF Serving",
   "TFV": "2-1",
   "TITLE": "AutoML training tabular binary classification model for batch prediction",
   "TRAINING": "training",
   "TRAINING_TIME": "20",
   "TRAINING_WAIT": "60",
   "WIDTH": "128",
   "YEAR": "2021",
   "null": "null",
   "uCAIP": "Vertex"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
